Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: continuous performance monitoring and PR comment #6283

Merged
merged 13 commits into from
Mar 27, 2024

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Mar 24, 2024

Motivation

In the context of:

starts the existing k6 script benchmark on the azure node using:

Then add a PR comment, example:

prompt_tokens_seconds

Attach results, image and logs as job artefacts:

Set the commit status with a minimized JSON results for later reprocessing.

Tested in:

TODO:

@phymbert phymbert requested a review from ngxson March 25, 2024 20:17
@phymbert phymbert marked this pull request as ready for review March 25, 2024 20:17
@phymbert phymbert requested a review from ggerganov March 25, 2024 20:17
@phymbert phymbert added performance Speed related topics server/webui need feedback Testing and feedback with results are needed labels Mar 25, 2024
@phymbert phymbert changed the title server: bench: init server: continuous performance monitoring and PR comment Mar 25, 2024
Comment on lines +18 to +21
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
pull_request:
types: [opened, synchronize, reopened]
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about excluding examples/ subdirectories except for examples/server? It could help reduce unneeded runs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in another PR if you dont' mind/

Comment on lines +22 to +23
schedule:
- cron: '04 2 * * *'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this scheduled run? If so, how will we view the results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, it will do the steps not related to PR: commit status and upload artefact. I will later process all commit checks statuses to show performance improvements day after day.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds awesome 👍 It would also be cool if we pile up the daily performance results somewhere and visualize the performance improvement.
Also, if scheduled run's role becomes to differ from PR-based runs too much, consider making it a separate workflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I want to do something like, probably stored on GH pages.

https://home.apache.org/~mikemccand/lucenebench/indexing.html

But it will require a little time and logic to reprocess previous commits, taking into account parameters have changed :/

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should put much effort in reprocessing previous commits. Better to focus just on the new versions from now on

@phymbert
Copy link
Collaborator Author

@ggerganov Hi Georgi, I think it is OK for a first version, please review the changes

@ggerganov
Copy link
Owner

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 26, 2024

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

@ggerganov I see, just create a classic token as I did on the fork will work. Also please add: workflow, write:discussion, repo:status, repo_deployment and public_repo.

@phymbert
Copy link
Collaborator Author

Cannot create a token with organization_self_hosted_runners:write since llama.cpp is not part of an organization

@ggerganov I see, just create a classic token as I did on the fork will work. Also please add: workflow, write:discussion, repo:status, repo_deployment and public_repo.

Also, probably it's better if you try to start the github runner manager yourself on the Azure T4 node:

git clone https://github.com/ggml-org/ci.git
cd ci
git remote add phymbert https://github.com/phymbert/ci.git
git fetch phymbert
git checkout hp/github-runner
./start-github-runner-manager.sh ggerganov/llama.cpp $TOKEN Standard_NC4as_T4_v3 

@ggerganov
Copy link
Owner

A classic token with workflows requires full access to the repo section, so it's not an option.

image

I tried to make a fine-grained token with the following config:

image

But I get an error:

ggml-github-runners-manager
ggml-ci: starting github runner manager on repo=ggerganov/llama.cpp label=Standard_NC4as_T4_v3...
ggml-ci: github runner manager started.
ggml-ci: github runner manager logs:
         CTRL+C to stop logs pulling
ggml-ci: fetching workflows of ggerganov/llama.cpp ...
ggml-ci:     ggml-runner-90970032-23091437579-pull_request-1711446732 triggered for workflow_name=Benchmark
invalid JIT response code: 403
    {"message":"Resource not accessible by personal access token","documentation_url":"https://docs.github.com/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository"}

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 26, 2024

A classic token with workflows requires full access to the repo section, so it's not an option.
invalid JIT response code: 403
{"message":"Resource not accessible by personal access token","documentation_url":"https://docs.github.com/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository"}

Yes I did not managed with fined grained token. Up to you, I do not see other option.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only need minor changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved
wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xzf prometheus*.tar.gz --strip-components=1
./prometheus --config.file=examples/server/bench/prometheus.yml &
while ! nc -z localhost 9090; do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a timeout here, just in case something goes wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow will be killed after a while. If you don't mind it can be added later on

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's not very important, but I still prefer not to rely on CI timeout because it can be long (usually minutes or hours), we should add a timeout of 10 seconds here for example.

Co-authored-by: Xuan Son Nguyen <[email protected]>
@phymbert
Copy link
Collaborator Author

phymbert commented Mar 26, 2024

A classic token with workflows requires full access to the repo section, so it's not an option.

@ggerganov Please note that the github default runner that we are currently using has a global token for the whole repo (GITHUB_TOKEN), so adding another self hosted runner with the same privilege does not hurt.
Also, what are the risks ? I see:

  • people who have access to the VM can see the token
    - people changing a workflow executed on this runner labels can do any action on the repo. But it should be someone allowed to trigger workflow run, so with write access.

We can add some checks in the runner manager to see on which branch the workflow will run, which author, and minimum approval.

Am I missing something ? Tell me if you want me to implement some security changes. I agree it might still be some work in progress

@ngxson if you have a better idea, this is very welcomed

EDIT: the token is only used in the manager to get a jitconfig, the runner will use GITHUB_TOKEN secret as all others runners. SO feel free to remove my ssh public key.

@ngxson
Copy link
Collaborator

ngxson commented Mar 26, 2024

The github token used by start-github-runner-manager.sh is only to generate JIT config as you said, so it's only visible to the manager container, but not the ephemeral runners created by the manager. In short, developers who trigger a new workflow cannot read the token.

The /actions/runners/generate-jitconfig endpoint requires administration:write as described in this page, that should be equivalent to "Administration" in fine-grant token. You need to add at least one repo to the token in order to see this list (I couldn't test it so I'm not sure if it works):

image

But anyway this permission is still very powerful, because it allows change list of collaborators, description of the repo,... So it should still be kept secured.

And yes anyone having access to the VM can see the token, but I don't think it's a problem. Gitlab runner works the same way (token need to be hard-written inside the VM). Also for now only you and @ggerganov have access to the VM so there's no problem.

@ggerganov
Copy link
Owner

The required token is too powerful - it can be used to delete the repo at the very least. So I'm hesitant to give access to it

It seems that the only option that we have atm is to limit access to the node just to myself and start the self-hosted manager. Will try to do this today or tomorrow

@phymbert
Copy link
Collaborator Author

The required token is too powerful - it can be used to delete the repo at the very least. So I'm hesitant to give access to it

It seems that the only option that we have atm is to limit access to the node just to myself and start the self-hosted manager. Will try to do this today or tomorrow

Thanks Georgi, I am definitely all in for the least priviledge approach, especially when there is CI automation, human mistakes always happen.

If you have time, could you simply delete the work VM and create a fresh one ?
in order to verify all the installation scripts are working in:

I am just imagining an other option:

  1. fork and sync llama.cpp in ggml-org
  2. start the runner in ggml-org with appropriate fined grained token
  3. in the workflow, change the commit status target to ggerganov/llama.cpp

Will it work ? less convenient but more secured.

@ngxson
Copy link
Collaborator

ngxson commented Mar 26, 2024

start the runner in ggml-org with appropriate fined grained token

Yeah seems like a good idea. Just remind that you can use change the repository in step actions/checkout@v3, so that the runner pull code directly from llama.cpp (it's a public repo anyway, so anyone can do git clone), no need to sync it in ggml-org.

@phymbert
Copy link
Collaborator Author

start the runner in ggml-org with appropriate fined grained token

Yeah seems like a good idea. Just remind that you can use change the repository in step actions/checkout@v3, so that the runner pull code directly from llama.cpp (it's a public repo anyway, so anyone can do git clone), no need to sync it in ggml-org.

The problem is how the workflow will be triggered this way. I need to think a little bit more if it's possible to schedule the workflow on another repo without git sync

@ngxson
Copy link
Collaborator

ngxson commented Mar 26, 2024

My idea is: From llama.cpp, we can send a request to ggml-org to tell it to trigger the pipeline. Imagine that it's a bit like our "Publish Docker image" step that make a call to the registry outside of the runner.

This approach requires llama.cpp to keep a ggml-org's token as secret, but the token only has actions:write permission so it shouldn't be a big problem. An example can be found here

Copy link
Contributor

github-actions bot commented Mar 27, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 522 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8956.63ms p(90)=25933.28ms fails=0, finish reason: stop=522 truncated=0
  • Prompt processing (pp): avg=237.61tk/s p(90)=698.3tk/s total=203.5tk/s
  • Token generation (tg): avg=97.26tk/s p(90)=257.93tk/s total=129.19tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/bench/workflow commit=4a6bfa92c5cfa12efa264c4c145dd91e6c8aba60
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 556.13, 556.13, 556.13, 556.13, 556.13, 465.15, 465.15, 465.15, 465.15, 465.15, 483.88, 483.88, 483.88, 483.88, 483.88, 540.38, 540.38, 540.38, 540.38, 540.38, 555.07, 555.07, 555.07, 555.07, 555.07, 553.7, 553.7, 553.7, 553.7, 553.7, 567.62, 567.62, 567.62, 567.62, 567.62, 576.91, 576.91, 576.91, 576.91, 576.91, 591.9, 591.9, 591.9, 591.9, 591.9, 592.78, 592.78, 592.78, 592.78, 592.78, 614.57, 614.57, 614.57, 614.57, 614.57, 643.54, 643.54, 643.54, 643.54, 643.54, 648.49, 648.49, 648.49, 648.49, 648.49, 675.49, 675.49, 675.49, 675.49, 675.49, 656.89, 656.89, 656.89, 656.89, 656.89, 658.39, 658.39, 658.39, 658.39, 658.39, 656.66, 656.66, 656.66, 656.66, 656.66, 672.61, 672.61, 672.61, 672.61, 672.61, 670.33, 670.33, 670.33, 670.33, 670.33, 670.6, 670.6, 670.6, 670.6, 670.6, 669.1, 669.1, 669.1, 669.1, 669.1, 669.62, 669.62, 669.62, 669.62, 669.62, 665.84, 665.84, 665.84, 665.84, 665.84, 676.65, 676.65, 676.65, 676.65, 676.65, 675.42, 675.42, 675.42, 675.42, 675.42, 676.08, 676.08, 676.08, 676.08, 676.08, 683.13, 683.13, 683.13, 683.13, 683.13, 681.71, 681.71, 681.71, 681.71, 681.71, 681.43, 681.43, 681.43, 681.43, 681.43, 682.11, 682.11, 682.11, 682.11, 682.11, 685.05, 685.05, 685.05, 685.05, 685.05, 684.43, 684.43, 684.43, 684.43, 684.43, 686.74, 686.74, 686.74, 686.74, 686.74, 690.8, 690.8, 690.8, 690.8, 690.8, 698.35, 698.35, 698.35, 698.35, 698.35, 698.63, 698.63, 698.63, 698.63, 698.63, 698.86, 698.86, 698.86, 698.86, 698.86, 697.45, 697.45, 697.45, 697.45, 697.45, 697.08, 697.08, 697.08, 697.08, 697.08, 697.46, 697.46, 697.46, 697.46, 697.46, 695.86, 695.86, 695.86, 695.86, 695.86, 699.79, 699.79, 699.79, 699.79, 699.79, 698.11, 698.11, 698.11, 698.11, 698.11, 696.16, 696.16, 696.16, 696.16, 696.16, 691.65, 691.65, 691.65, 691.65, 691.65, 689.65, 689.65, 689.65, 689.65, 689.65, 688.07, 688.07, 688.07, 688.07, 688.07, 686.51, 686.51, 686.51, 686.51, 686.51, 679.97, 679.97, 679.97, 679.97, 679.97, 682.74, 682.74, 682.74, 682.74, 682.74, 682.75, 682.75, 682.75, 682.75, 682.75, 680.31, 680.31, 680.31, 680.31, 680.31, 683.18, 683.18, 683.18, 683.18, 683.18, 683.45, 683.45, 683.45, 683.45, 683.45, 684.14, 684.14, 684.14, 684.14, 684.14, 683.68, 683.68, 683.68, 683.68, 683.68, 686.19, 686.19, 686.19, 686.19, 686.19, 687.9, 687.9, 687.9, 687.9, 687.9, 688.43, 688.43, 688.43, 688.43, 688.43, 688.97, 688.97, 688.97, 688.97, 688.97, 688.75]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.56, 24.56, 24.56, 24.56, 24.56, 21.0, 21.0, 21.0, 21.0, 21.0, 20.73, 20.73, 20.73, 20.73, 20.73, 19.34, 19.34, 19.34, 19.34, 19.34, 19.5, 19.5, 19.5, 19.5, 19.5, 19.84, 19.84, 19.84, 19.84, 19.84, 20.51, 20.51, 20.51, 20.51, 20.51, 20.57, 20.57, 20.57, 20.57, 20.57, 20.64, 20.64, 20.64, 20.64, 20.64, 20.52, 20.52, 20.52, 20.52, 20.52, 20.48, 20.48, 20.48, 20.48, 20.48, 20.38, 20.38, 20.38, 20.38, 20.38, 20.05, 20.05, 20.05, 20.05, 20.05, 19.39, 19.39, 19.39, 19.39, 19.39, 18.83, 18.83, 18.83, 18.83, 18.83, 18.74, 18.74, 18.74, 18.74, 18.74, 18.89, 18.89, 18.89, 18.89, 18.89, 19.04, 19.04, 19.04, 19.04, 19.04, 18.88, 18.88, 18.88, 18.88, 18.88, 18.79, 18.79, 18.79, 18.79, 18.79, 18.69, 18.69, 18.69, 18.69, 18.69, 18.48, 18.48, 18.48, 18.48, 18.48, 18.45, 18.45, 18.45, 18.45, 18.45, 18.48, 18.48, 18.48, 18.48, 18.48, 18.4, 18.4, 18.4, 18.4, 18.4, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.47, 18.51, 18.51, 18.51, 18.51, 18.51, 18.6, 18.6, 18.6, 18.6, 18.6, 18.66, 18.66, 18.66, 18.66, 18.66, 18.78, 18.78, 18.78, 18.78, 18.78, 18.86, 18.86, 18.86, 18.86, 18.86, 18.77, 18.77, 18.77, 18.77, 18.77, 18.79, 18.79, 18.79, 18.79, 18.79, 18.65, 18.65, 18.65, 18.65, 18.65, 18.59, 18.59, 18.59, 18.59, 18.59, 18.62, 18.62, 18.62, 18.62, 18.62, 18.67, 18.67, 18.67, 18.67, 18.67, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.69, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.55, 18.18, 18.18, 18.18, 18.18, 18.18, 18.15, 18.15, 18.15, 18.15, 18.15, 18.08, 18.08, 18.08, 18.08, 18.08, 17.76, 17.76, 17.76, 17.76, 17.76, 17.38, 17.38, 17.38, 17.38, 17.38, 17.39, 17.39, 17.39, 17.39, 17.39, 17.41, 17.41, 17.41, 17.41, 17.41, 17.48, 17.48, 17.48, 17.48, 17.48, 17.52, 17.52, 17.52, 17.52, 17.52, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.48, 17.48, 17.48, 17.48, 17.48, 17.46, 17.46, 17.46, 17.46, 17.46, 17.5, 17.5, 17.5, 17.5, 17.5, 17.55, 17.55, 17.55, 17.55, 17.55, 17.64]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.48, 0.48, 0.48, 0.48, 0.48, 0.49, 0.49, 0.49, 0.49, 0.49, 0.46, 0.46, 0.46, 0.46, 0.46, 0.5, 0.5, 0.5, 0.5, 0.5, 0.54, 0.54, 0.54, 0.54, 0.54, 0.38, 0.38, 0.38, 0.38, 0.38, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711566053 --> 1711566683
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0]
                    
Loading

@phymbert
Copy link
Collaborator Author

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 482 iterations 🚀

  • Concurrent users: 8
  • HTTP request : avg=9739.4ms p(90)=26438.48ms passes=482reqs fails=0reqs
  • Prompt processing (pp): avg=245.09tk/s p(90)=741.1tk/s total=193.57tk/s
  • Token generation (tg): avg=99.01tk/s p(90)=278.66tk/s total=129.61tk/s
  • Finish reason : stop=482reqs truncated=0
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/bench/workflow commit=1c1f8769947ef6e483809beec87b59051cf3e435

@ggerganov please review the added comment and I think we are good

@phymbert
Copy link
Collaborator Author

@ggerganov please review the added comment and I think we are good

It looks we have our baseline ^^but wondering why this VM is slower than before:

Is it the same CPU/RAM/T4 ?

@ggerganov
Copy link
Owner

Yes, it's Standard_NC4as_T4_v3 as before. Not sure why there is a difference

Btw, I'm not super confident about the PR comments - might get annoying to have those on every PR. For now lets put everything after the bullet points in <details> so they are more compact. And will see if we want to keep them depending on how useful/distracting they are

@phymbert
Copy link
Collaborator Author

Yes, it's Standard_NC4as_T4_v3 as before. Not sure why there is a difference

Btw, I'm not super confident about the PR comments - might get annoying to have those on every PR. For now lets put everything after the bullet points in <details> so they are more compact. And will see if we want to keep them depending on how useful/distracting they are

Done, note there is only one comment per PR

@phymbert phymbert merged commit a016026 into master Mar 27, 2024
27 of 28 checks passed
@phymbert phymbert deleted the hp/server/bench/workflow branch March 27, 2024 19:26
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
)

* server: bench: init

* server: bench: reduce list of GPU nodes

* server: bench: fix graph, fix output artifact

* ci: bench: add mermaid in case of image cannot be uploaded

* ci: bench: more resilient, more metrics

* ci: bench: trigger build

* ci: bench: fix duration

* ci: bench: fix typo

* ci: bench: fix mermaid values, markdown generated

* typo on the step name

Co-authored-by: Xuan Son Nguyen <[email protected]>

* ci: bench: trailing spaces

* ci: bench: move images in a details section

* ci: bench: reduce bullet point size

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
)

* server: bench: init

* server: bench: reduce list of GPU nodes

* server: bench: fix graph, fix output artifact

* ci: bench: add mermaid in case of image cannot be uploaded

* ci: bench: more resilient, more metrics

* ci: bench: trigger build

* ci: bench: fix duration

* ci: bench: fix typo

* ci: bench: fix mermaid values, markdown generated

* typo on the step name

Co-authored-by: Xuan Son Nguyen <[email protected]>

* ci: bench: trailing spaces

* ci: bench: move images in a details section

* ci: bench: reduce bullet point size

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
)

* server: bench: init

* server: bench: reduce list of GPU nodes

* server: bench: fix graph, fix output artifact

* ci: bench: add mermaid in case of image cannot be uploaded

* ci: bench: more resilient, more metrics

* ci: bench: trigger build

* ci: bench: fix duration

* ci: bench: fix typo

* ci: bench: fix mermaid values, markdown generated

* typo on the step name

Co-authored-by: Xuan Son Nguyen <[email protected]>

* ci: bench: trailing spaces

* ci: bench: move images in a details section

* ci: bench: reduce bullet point size

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed performance Speed related topics server/webui
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants